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Drug discovery is the process of identifying compounds which 
have potentially meaningful biological activity. A major challenge 
that arises is that the number of compounds to search over can be 
quite large, sometimes numbering in the millions, making experimen- 
tal testing intractable. For this reason computational methods are 
employed to filter out those compounds which do not exhibit strong 
biological activity. This filtering step, also called virtual screening re- 
duces the search space, allowing for the remaining compounds to be 
experimentally tested. 

In this paper we propose several novel approaches to the problem 
of virtual screening based on Canonical Correlation Analysis (CCA) 
and on a kernel-based extension. Spectral learning ideas motivate our 
proposed new method called Indefinite Kernel CCA (IKCCA). We 
show the strong performance of this approach both for a toy prob- 
lem as well as using real world data with dramatic improvements in 
predictive accuracy of virtual screening over an existing methodology. 

1. Introduction. Computer-Aided Drug Discovery (CADD) is an area 
of research that is concerned with the identification of chemical compounds 
that are likely to possess specific biological activity, that is, the ability to 
bind certain target biomolecules such as proteins. CADD approaches are em- 
ployed in order to prioritize molecules in commercially available chemical li- 
braries for experimental biological screening. The prioritization of molecules 
is critical since these libraries frequently contain many millions of molecules 
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making experimental testing intractable. The process of using computational 
methods to filter out those compounds which are not expected to exhibit 
strong biological activity is called virtual screening. 

Computational methods have been used extensively to assist in experi- 
mental drug discovery studies. In general, there are two major computational 
drug discovery approaches, ligand based and structure based. The former is 
used when the three-dimensional structure of the drug target is unknown 
but the information about a reasonably large number of organic molecules 
active against a specific set of targets is available. In this case, the available 
data can be studied using cheminfomatic approaches such as Quantitative 
Structure-Activity Relationship (QSAR) modeling [for a review of QSAR 
methods see A. Tropsha, in Abraham (2003)]. In contrast, the structure- 
based methods rely on the knowledge of three-dimensional structure of the 
target protein, especially its active site; this data can be obtained from ex- 
perimental structure elucidation methods such as X-ray or Nuclear Magnetic 
Resonance (NMR) or from modeling of protein three-dimensional structure. 

Virtual screening is one of the most popular structure-based CADD ap- 
proaches where, typically, three-dimensional protein structures are used to 
discover small molecules that fit into the active site (a process referred to 
as docking) and have high predicted binding affinity (scoring). Traditional 
docking protocols and scoring functions rely on explicitly defined three- 
dimensional coordinates and standard definitions of atom types of both 
receptors and ligands. Albeit reasonably accurate in some cases, structure- 
based virtual screening approaches are for the most part computationally in- 
efficient [Warren et al. (2006)]. As a result of computational inefficiency there 
is a limit to the number of compounds which can reasonably be screened by 
these methods. Furthermore, recent extensive studies into the comparative 
accuracy of multiple available scoring functions suggest that accurate predic- 
tion of binding orientations and affinities of receptor-ligand pairs remains 
a formidable challenge [Kitchen et al. (2004)]. Yet millions of compounds 
in available chemical databases and billions of compounds in synthetically 
feasible chemical libraries are available for virtual screening calling for the 
development of approaches that are both fast and accurate in their ability to 
identify a small number of viable and experimentally testable computational 
hits. 

Recently, we introduced a novel structure-based cheminformatic workflow 
to search for Complimentary Ligands Based on Receptor Information (CoL- 
iBRI) [Oloff et al. (2006)]. This novel computational drug discovery strategy 
combines the strengths of both structure-based and ligand-based approaches 
while attempting to surpass their individual shortcomings. In this approach, 
we extract the structure of the binding pocket from the protein and then 
represent both the receptor active site and its corresponding ligand in the 
same universal, multidimensional chemical descriptor space (note that in 
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principle, the descriptors used for receptors and ligands do not have to be 
the same, and we will be exploring the use of different descriptor types in 
future studies). We reasoned that mapping of both binding pockets and cor- 
responding ligands onto the same multidimensional chemistry space would 
preserve the complementary relationships between binding sites and their re- 
spective ligands. Thus, we expect that ligands binding to similar active sites 
are also similar. In cheminformatics applications, the similarity is described 
quantitatively using one of the conventional metrics, such as Manhattan or 
Euclidean distance in multidimensional descriptor space. Thus, the chief hy- 
pothesis in CoLiBRI is that the relative location of a novel binding site with 
respect to other binding sites in multidimensional chemistry space could be 
used to predict the location of the ligand(s) complementary to this site in the 
ligand chemistry space. After generation of descriptors, the dataset is split 
into training and test sets and then variable selection is carried out to gen- 
erate models optimizing this complementarity between the binding pocket 
and ligand spaces. These models are then applied to a binding pocket in 
a protein of interest to generate a predicted virtual ligand point which is 
used as a query in chemical similarity searches to identify putative ligands 
of the receptor in available chemical databases. In this paper, we build upon 
the work of Oloff et al. (2006) to develop a substantially more advanced and 
efficient version of CoLiBRI. 

The problem can be generally stated as follows: for a set of n known 
protein-ligand pairs, with dx and dy descriptors, respectively, given a new 
protein we want to be able to predict what ligand(s) will bind to it. Two 
virtual drug screens will be used as a benchmark for testing the methods 
discussed and developed here: 

(1) A set of 800 chemically and functionally diverse protein-ligand pairs 
obtained from the Protein Data Bank (PDB) database on experimentally 
measured binding affinity (PDBbind) [Wang et al. (2004)]. These compounds 
are described by a set of 150 chemical descriptors. These descriptors include 
information related to the electronic attributes, hydrophobicity and steric 
properties of the compounds. For a more detailed discussion on the different 
types of chemical descriptors, see Todeschini and Consonni (2009a, 2009b). 
We will refer to this data set as the 800 Receptor-Ligand pairs (RLP800) 
data. Results and further details on this and two additional data sets can 
be found in Section 1.1. 

(2) The World Drug Index (WDI) [Daylight (2004)] database which con- 
tains approximately 54,000 drug candidates (ligands). Each compound in 
the WDI is described by the same set of 150 chemical descriptors as the 
RLP800 data. 

The accuracy of our prediction is based on how close, in Euclidean dis- 
tance, our prediction is to the actual ligand. This is then compared against 
the distances of all of the ligands in the space to the actual ligand. A stan- 
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Fig. 1. (A) A histogram showing the IKCCA ranks, ri, resulting from prediction on the 
test data from the RLP800 dataset. (B) Performance on the WDI data. 



dard measure of predictive accuracy used in the QSAR literature [Tropsha 
(2003), Oloff et al. (2006)] is based on ranking these distances, from smallest 
to largest. Defining to be the rank of our prediction of test ligand i, model 
performance is defined as the average rank over each of the new points we 
are trying to predict, f = X^™=i r i- This criterion reflects the average size 
of the search space needed to find each compound. Here ny denotes the 
number of new (i.e., test) ligands we are predicting. 

The f effectiveness of the methods studied and developed here is illus- 
trated in Figure 1. Figure 1(A) is a histogram of the ranks, rj for our novel 
method which is a variant of Canonical Correlation Analysis (CCA) we call 
Indefinite Kernel CCA (IKCCA) (Section 4), on the RLP800 data. 

The previous state of the art for these data sets is toloff (the vertical 
line furthest to the right labelled Oloff et al.) which are larger by a factor 
of 5 to 10 as compared to CCA (Section 2) and its improvements, KCCA 
(Section 3) and IKCCA. 

As we were primarily interested in comparing our results against those 
of Oloff et al. (2006) we did not look into other performance metrics other 
than mean rank. However, it would be interesting to pursue other, poten- 
tially more relevant measures of binding affinity such as Kd, Ki and IC50 
as was done by Witten and Tibshirani (2009), where CCA is linked to these 
performance measures. 

While not discussed in this paper, an important unresolved issue in this 
cheminformatic-based approach to the prediction of protein-ligand bind- 
ing is the selection of meaningful chemical descriptors. The type of chemi- 
cal descriptors used can have a drastic effect on the predictive accuracy of 
an algorithm. One possible approach to addressing this issue would be to 
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use a recently developed method called Sparse CCA (SCCA), Hardoon and 
Shawe- Taylor (2008), Parkhomenko (2008), Witten, Tibshirani and Hastie 
(2009) and Witten and Tibshirani (2009). SCCA uses a lasso-like approach 
to identify sparse linear combinations of two sets of variables that are highly 
correlated with each other. An approach based on SCCA to the prediction 
of protein-ligand binding may prove to be quite useful in resolving some of 
the issues arising from chemical descriptor selection. 

In Section 1.1 we present results and details on the RLP800 data set as 
well as on two additional data sets. In Sections 2 and 3 we outline CCA 
and KCCA, respectively. In Section 4 we propose a new method, IKCCA, 
which encompasses nonpositive semi-definite (PSD) kernels (i.e., indefinite 
kernels), specifically we consider a class of kernels related to the Normalized 
Graph Laplacian used in Spectral Clustering. Finally, in Section 5 we show 
how prediction of a new ligand is done using CCA (and its variants). 

1.1. Additional drug discovery results. In addition to the real data results 
discussed in Section 1, we also tested our method on two additional data sets 
[which we refer to as Experimental Settings (ES), the reason for which will 
become clearer in what follows]. These data (including the RLP800 data) 
are subsets of a collection of 1,300 complexes taken from PDBBind [Wang 
et al. (2004)]. These 1,300 complexes are referred to as the Refined Set (RS), 
a set of entries that meet a defined set of criteria regarding crystal structure 
quality. A representative subsample of 195 of the complexes is called the 
Core Set (CS). This is a collection of complexes selected by clustering the 
RS into 65 groups using protein sequence similarity and retaining only 3 
complexes from each cluster. 

The three experimental settings considered are denoted by ES I [this ex- 
perimental setting was used in Oloff et al. (2006)], ES II and ES III. In 
each of these experimental settings the RS and CS complexes are separated 
into training and testing sets in such a way as to test different aspects of 
our model. In ES I the 637 training and 163 test complexes are randomly 
sampled from the RS. ES I is meant to provide a general test of our models 
performance. In ES II the training (153 complexes) and testing (36 com- 
plexes) sets are sampled from the CS in such a way that the various protein 
families in the CS are well represented in both. This separation is meant to 
test the performance of our CCA-based methods when the sample size is 
small. Finally, in ES III the testing set (162 complexes) is composed of pro- 
teins which are under represented in the training set (1,006 complexes). This 
is meant to test our methods ability to correctly identify novel complexes. 

A note on how we use the training and testing sets: the tuning parameters 
for our model are selected, as discussed in Section 5.2, using only the training 
set. Once tuning parameters have been selected, prediction on the testing set 
is then performed. This is meant to test the models performance on as-yet 
unobserved complexes. 
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Table 1 

This table summarizes the performance of the method discussed in Oloff et al. (2006) as 

well as the methods developed in this paper on the CoLiBRI, Name and Cluster data 
sets. The columns labeled "Train" and "Test" correspond to the number of training and 
testing samples for a given data set. The column labeled "Embed" corresponds to the total 
number of ligands against which our prediction is to be ranked. The remaining columns 
correspond to the method used and the average rank performance of that method. Note 
that as the method used in Oloff et al. (2006) failed to provide useful results for the 
Name and Cluster data sets, no results were reported. In all cases IKCCA, using the 
NGL kernel, outperformed the other methods. In all cases the CCA-based methods 
provide considerable improvement over the previous approach 
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NA 
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4.4 
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54,120 


NA 


53 


24.3 


18.2 



The results for each of these experimental settings is summarized in Ta- 
ble 1. The columns labeled "Train" and "Test" correspond to the size of the 
training/testing sets for each particular experimental setting. The column 
labeled "Embed" corresponds to the total number of ligands against which 
our prediction is to be ranked. The remaining columns correspond to the 
method used and the average rank performance (defined in Section 1) of that 
method. The second row, second column in each cell labeled "RS + WDI" 
shows the results for each method on the Reduced Set plus the World Drug 
Index. These results are meant to more accurately mimic an actual drug 
screen by having a larger test set to search against. As the method used 
in Oloff et al. (2006) failed to provide useful results for the ES II and ES 
III experimental settings, no results are reported here. Generally speaking, 
in all cases IKCCA, using the NGL kernel, outperformed the other meth- 
ods. All the CCA-based methods provide considerable improvement over the 
previous approach. 

Looking a bit closer at the results it is interesting to note that while all 
three CCA-based methods performed worse on the ES II data, KCCA had 
the largest drop in performance. This can be seen by comparing the aver- 
age rank performance against the total number of ligands we are searching 
against. The decrease in performance in all cases more than likely has to do 
with the small size of the training set. In the case of KCCA, its considerable 
decrease in performance, we suspect, may have to do with not having a large 
enough training sample to reliably select the bandwidth parameter a. For 
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IKCCA the adaptive nature of the local kernel is probably what allows it to 
perform well in the low sample size setting. 

2. Canonical correlation analysis. CCA [Hotelling (1936)] naturally lends 
itself to the problem of predicting the binding between proteins and ligands. 
This can be understood for the following reasons: first, traditional meth- 
ods of prediction, for example, regression, assume a direction of dependence 
between the variables to be predicted and the predictive variables. Here 
we have a symmetric, not causal, type of relationship: the binding between 
a protein and its ligand is inherently co-dependent. Second, in addition to 
capturing the dependence structure we are looking to model, CCA is well 
suited to the type of prediction we are interested in performing. To under- 
stand this, consider the following (see also Section 2.2.1 for a more detailed 
discussion). The objective of CCA is to find directions in one space, and di- 
rections in a second space such that the correlation between the projections 
of these spaces onto their respective directions is maximized. These direc- 
tions are commonly referred to as canonical vectors. Let us assume that 
a set of directions are found so that the corresponding projections of pro- 
teins and of ligands are strongly correlated. Predicting a new ligand given 
a new protein would begin with projecting the new protein into canonical 
correlation space. Then, assuming the same correlation structure holds for 
this new point, prediction of the new ligand would amount to interpolating 
its location in ligand space based on the location of the protein in protein 
space. This will be discussed in greater detail in Section 5. Next we provide 
a brief discussion on the details of CCA and KCCA. 

2.1. Canonical correlations. Let x, G M. dx and y,; G R dy , i = l,...,n, 
denote a protein-ligand pair. The sample of pairs is collected in matrices 
X G M. nxdx and Y G M nxdy with x, and y, as the descriptors for a row. 

The objective of CCA is to find the linear combinations of the columns 
of X (proteins), say Xwx and the linear combinations of the columns of Y 
(ligands), say Ywy such that the correlation, corr(Xwx, Ywy) is maxi- 
mized. Without loss of generality assume that the matrices X and Y have 
been mean centered. Letting Sxx — X r X, Syy = Y r Y and Sxy = X r Y 
the CCA optimization problem is 

p= max corr(Xwjf, Ywy) 

(2.1) subject to, 

w |SlX w X = WySyyWy = 1. 

Subsequent directions are found by imposing the additional constraints 
(w^) x Sxiw^ = (wy) T SyyWy = (wj) T SxyWy = for i ^ j and 
(w x ) T Sxxw x = (w y ) T Syyw y = 1, i,j = 1,. . . ,p, p = mm(d x ,d Y )- 
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In order to avoid issues arising from multicollinearity and singularity of 
the covariance matrices we impose a penalty [Vinod (1976)] on the direc- 
tions wx and wy so that the constraints in (2.1) are modified to be 

rji rr~\ rji rri 

(2.2) WjSjcjWjc + KWyWj = WySyyWy + KWyWy = 1, 

where ft G K is a regularization parameter. 

The predictive accuracy of this approach was discussed in Section 1, with 
results summarized in Figure 1. Recall that the lines in these figures labeled 
CCA correspond to the average predicted rank using CCA, which improved 
upon Oloff et al. (2006) shown by the lines labeled a such. 

2.1.1. The geometry of CCA. An appealing aspect of CCA is its intuitive 
geometric interpretation [Anderson (2003) and Kuss and Graepel (2002)]. 
A geometric perspective lends itself to a better understanding of the general 
behavior of CCA, and provides further evidence of its applicability to the 
protein-ligand matching problem. 

Taking a closer look at the ith canonical correlation, pi, i = 1, . . . ,p (p = 
mm(dx , dy)) , in the optimization problem shown in (2.1), it can be seen that 
this quantity is in fact equal to the cosine of the angle between & x = Xw l x 
and a y = Ywy (a^- and ay are commonly referred to as canonical variates). 
With this in mind maximizing the cosine (i.e., correlation) can equivalently 
be thought of as minimizing the angle between sl x and ay. Furthermore, 
it can be shown that minimizing the angle is equivalent to minimizing the 
distance between pairs of canonical variates, 

min ||Xw^ — Ywy || 2 

w x ,w i' 

subject to the constraints described in (2.1). Note that viewed in this way, in 
canonical correlation space, this amounts to finding a system of coordinates 
such that the distance between coordinates is minimized. This is a sense 
in which CCA is an appropriate approach to the protein-ligand matching 
problem. 

As will be seen in Sections 3 and 4, this geometric interpretation of CCA 
extends naturally to KCCA and IKCCA. Note that the regularized variant of 
CCA does not have the same geometric interpretation, nonetheless viewing 
regularized CCA in this manner still provides useful insight into its behavior. 

2.2. Toy examples. 

2.2.1. Toy example 1: Motivating CCA. Consider the protein-ligand 
matching problem as outlined above. For this toy example we set n = 10 
and d = 2. Suppose the descriptors for this toy example are Molecular 
Weight (MW) and Surface Area (SA) of the molecule. Recall that each row 
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Fig. 2. Toy example data. The points highlighted in red correspond to the protein-ligand 
pair llgs, and the points connected to it by dashed black lines are its three nearest neighbors 
in each space. The observations highlighted in cyan are neighbors in both spaces, and 
those highlighted in blue and purple are neighbors only in the protein, and ligand spaces, 
respectively. The green point L new in the ligand space corresponds to a weighted average 
(discussed in Section 5) of the cyan points and the purple point, that is, of the nearest 
neighbors of llgs in the protein space. 



of X(iox2) an d each row of Y( 10X 2) corresponds to an observation, a pro- 
tein or a ligand, respectively, and the columns correspond to the descriptors 
MW and SA. The pairs are identified by a unique label, corresponding to 
IDs from the Protein Data Bank (PDB) (www.pdb.org). Figure 2 shows the 
two toy data sets. 

From Figure 2 it can be seen that the distribution of points in the two 
spaces are quite similar in the sense that the location of corresponding points 
in the two spaces are close. The points connected to llgs (red) by dashed 
black lines are its three nearest neighbors. The cyan points are neighbors 
shared in both spaces and the blue and purple points are mismatched. Two 
of three neighbors are shared in common (in the Euclidean sense). 

Consider the case where the red point in ligand space is not observed and 
the task is to predict its value. Using the weighted average (see Section 5 
for details on the derivation of the weights) of the points in ligand space 
that correspond to the nearest neighbors of the point llgs in the protein 
space (points highlighted in cyan and purple in ligand space) would yield 
a relatively poor prediction despite the strong apparent similarity between 
the two distributions of points. 

Next suppose that instead of carrying out the prediction of a new ligand in 
the original data space we carry out our prediction in canonical correlation 
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Fig. 3. The direction vectors and the projected value of each point. The top row of plots 
shows the first direction vector and the projections onto it. The bottom row of plots show 
the second direction vector and the projection onto it. 



space. Solving for wx and wy in (2.1), gives us the canonical vectors shown 
in Figure 3. What is important to notice is how the distribution of points 
along the first and second canonical directions in both protein and ligand 
space are quite similar. This is due to the property of alignment that arises 
naturally from maximizing the correlation. 

Figure 4 shows the projections of the data onto the first two canonical 
vectors (note that separate directions are found in protein and ligand space) . 
We can see that with the slight modification in alignment that has resulted 
from the CCA projections, the point llgs now shares the same neighbors in 
both spaces. In particular note that now the predicted value in the projected 
ligand space is closer to the actual value (again using the weighted average) . 
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Fig. 4. Projection of the data in Figure 2 onto the first and second canonical vectors. In 
contrast to Figure 2. the point llgs now shares the same neighbors in both spaces and the 
predicted value in green is much closer to the actual value. 



This example was deliberately chosen to illustrate the case where CCA is 
effective. However, in most cases the relationship between points in different 
spaces may be far more complicated, as we now illustrate. 

2.2.2. Toy example 2: CCA challenge. We now consider an example 
where the relationship between spaces is more complex. Suppose that we 
have the same general framework as in Section 2.2.1 but rather than hav- 
ing both protein and ligand space characterized by MW and SA, we now 
have that the space of proteins has descriptors d x and d x and that the 
space of ligands has descriptors d Y and d Y , shown in Figure 5. As before 
the observation highlighted in red, la94, corresponds to a new protein whose 
corresponding ligand we are trying to predict. The point highlighted in cyan 
is one of the 3-nearest neighbors of la94 in both spaces. Those points high- 
lighted in purple (and blue) are nearest neighbors in only the protein (and 
ligand) spaces, respectively. The point L ncw in the ligand space, highlighted 
in green is a weighted average of the nearest neighbors of the point la94 
in protein space. Using L ncw as a prediction of the new ligand would not 
provide a particularly accurate prediction. 

As before, we use CCA to try and find a linear combination of the descrip- 
tors which best align the two spaces. Figure 6 is a plot of the projections 
onto the first and second canonical variates in protein and ligand space. The 
color scheme is the same as in Figure 5. As can be seen, standard CCA does 
not seem to be able to find a good alignment between the two spaces, which 
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Fig. 5. A plot of the data generated such that the underlying relationship between points 
is nonlinear. The observation highlighted in red, la94, is the new observation which we are 
trying to predict. The points joined to it by dashed black lines are its nearest neighbors. 
The points highlighted in cyan correspond to points that are nearest neighbors of la94 
in both spaces. Points highlighted in purple and blue correspond to points that are only 
neighbors in either protein or ligand space, respectively. The point labeled £ now m ligand 
space corresponds to a weighted average of the points la08, la09 and lalb, that is, the 
nearest neighbors of the point la94 in protein space. 



is confirmed by the relatively low values of the canonical correlations, 0.79 
and 0.54, respectively, for the first and second directions. 

In Section 3 we show how mappings into a kernel induced feature space 
can be used to improve prediction. This will lead to our discussion of KCCA. 

3. Kernel canonical correlation. 

3.1. Toy example 2: CCA challenge (motivating KCCA). Returning to 
the example in Section 2.2.2, suppose it is believed that some type of func- 
tional relationship exists between the descriptors across spaces that is best 
characterized by looking at the second order polynomials of the descriptors 
within each space, that is, 

$x:(4,4)^((4) 2 ,(4) 2 ,44), 

(3.1) 

$y:(4,4)^((4) 2 ,(4) 2 ,44). 

Figure 7 shows plots of proteins and ligands embedded into this three di- 
mensional space. As can be seen there are now two neighbors shared in 
common between spaces (colored in cyan). Furthermore the prediction of 
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Fig. 6. A plot of the data projected onto the first two canonical vectors in both protein 
and ligand spaces. The directions found by standard CCA do not provide a good alignment 
between the two spaces. 



the new observation, L ncw (in green) by a weighted average of its three 
nearest neighbors in feature space is, by comparison, much closer to the 
actual value than the corresponding prediction in object space. 

As before CCA is used on this transformed data, now in feature space, to 
align the space of proteins and ligands. Figure 8 shows a plot of the projected 
data. Note that now both the new protein and its ligand (highlighted in red) 
share three neighbors and that the distribution of points within each of the 
spaces is quite similar. The quality of the alignment is further confirmed 
by looking at the canonical correlation values which are near 1 for each of 
the first two directions. Since the value of the third canonical correlation is 
considerably smaller (approximately 0.2) we only project onto the first two 
directions. 

It is worth noting that, as a result of overfitting, the kernel canonical 
correlation values can sometimes be artificially large due to strong correla- 
tion between features in kernel space. Regularization methods for helping to 
control these effects in the kernel case will be discussed in Section 3.2. 

In general, finding explicit mappings such as those in (3.1) is impractical 
or simply not possible as in some cases this would require an infinite dimen- 
sional feature space. As we will see in the following section, kernels allow us 
to avoid such issues. 

3.2. Kernel canonical correlation analysis. KCCA [Bach and Jordan 
(2002), Lai and Fyfe (2000), Hardoon, Szedmak and Shawe-Taylor (2004)] 
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Fig. 7. j4 p/ot of protein and ligand data in feature space. The color scheme is the same 
as in Figure 5. Looking at the plots on the top and bottom ( corresponding to protein and 
ligand space, respectively), the overall correspondence between points in protein space and 
ligand space is much better than in the original (object) space. This improved mapping will 
allow CCA to do a better job aligning the two spaces. 



extends CCA by finding directions of maximum correlation in a kernel in- 
duced feature space. Let cpx and (fty be the feature space maps for proteins 
and ligands, respectively. The sample of pairs, now mapped into feature 
space, are collected in matrices $x and <I>y with 0x( x i) and ^y(yi) as their 
respective row elements. The objective, as before, is to find linear combina- 
tions, ^jjfWjjf and $ywy such that the correlation, corr($xWjjf , $ywy), is 
maximized. Note that because wx and wy lie in the span of &x and $y, 
these can be re-expressed by the linear transformations wx = $xajf and 
wy = $yay. Letting Kx = &x&x an< ^ ^ Y = &Y&y w ith kx and Ay be- 
ing the associated kernel functions for each space, respectively, the CCA 
optimization problem in (2.1) now becomes 

PH= m ax corr($xwjf, <£ywy) = max corr(Kxax, Kyay ) 

wx,wy ax ,fty 

(3.2) subject to 

w x^x^x w x = «xKjax = 1 and wy<£y<3?ywy = OyKyay = 1. 
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Fig. 8. This is a plot of the projection of the data in protein and ligand feature space onto 
the first and second canonical vectors ( note that we do not project onto the third canonical 
vector as the corresponding canonical correlation is relatively small, approximately 0.2). As 
can be seen, not only does the new observation la94 (red) have three neighbors in common 
in both protein and ligand space but the prediction of the new ligand using a weighted 
average, _L now highlighted in green on the plot on the right is close to the actual value of 
la94. 



Here the subscript H in pu is included to emphasize the fact that the space 
of functions we are considering are in a RKHS. Subsequent directions are 
found by including the additional constraints that a^K^cr^- = «y Ky«y = 

a*Y KjfKyOy = for i / j, and a l ^K. x a x = ayYJyay = 1, i,j = l,...,n. 

In order to avoid trivial solutions, we penalize the directions ax and ay 
modifying the constraints in (3.2) to be 

(3.3) a x K x ai + Ka x otx = a Y ^-y a Y + Kaycxy = 1- 

Here ft is a regularization parameter. 

Note that the geometric interpretation of (unregularized) KCCA, pro- 
vided that data have been centered in feature space, is the same as CCA. 
The only difference lies in the fact that the space in which this geometry is 
observed is in feature space rather than object space. 

In order for KCCA to be understood as maximizing correlation in feature 
space centering must be performed in feature space. Centering in feature 
space can be done as follows. Let $ = - J<3? where J is an n x n matrix of 
ones, then 

(<£ -<£)(<]?- <5) T = - <&<i> T - <S<I> T + l>$ T 

= K - - JK - ^KJ + 4r JKJ 

n n n A 
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Fig. 9. A toy example illustrating the cases when the distribution of points within a space 
is nonstandard and heterogeneous. 

-IjWi-Ij 

n J \ n 

We assume throughout that the kernel matrices are centered. 

The predictive accuracy of this approach was discussed in Section 1, with 
results summarized in Figure 1. Recall that the cyan line in Figure 1 corre- 
sponds to the average predicted rank using KCCA which is an improvement 
over both Oloff et al. (2006) and CCA. 



3.3. Toy example 3: KCCA challenge. We saw in Section 3.1 that KCCA 
was able to overcome some of the obstacles encountered by standard CCA. 
Where KCCA begins to encounter problems is when the distribution of 
points within a space is nonstandard and/or heterogeneous. To illustrate 
this consider the example shown in Figure 9, as with the protein-ligand 
matching problem, there is a one-to-one correspondence between points in 
the two spaces. 

The underlying structure between these spaces is illustrated in Figure 10. 
The top row of plots tells us about how the distribution of points on the 
right (cluster space) relates to the distribution of points on the left (smiley 
face space). The bottom set of plots tells us about how the distribution of 
points on the left is related to distribution of points on the right. 

If we were to look at the two spaces as marginal distributions, there is 
a distinct impression of the three clusters in the left, and two in the right. 
The joint distribution, however, has six distinct groups. Looking at the plots 
on the left in Figure 10, each of the three clusters is in fact composed of 
two subclusters. Likewise, each of the two clusters in the plots on the right 
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Fig. 10. These plots highlight how the distribution of points in one space is related to the 
distribution of points in the other. Looking at the plots on the left in Figure 10, each of 
the three clusters is in fact composed of two subclusters. Likewise, each of the two clusters 
in the plots on the right are composed of three subclusters. 

are composed of three subclusters. Ideally, the projections onto the KCCA 
directions would identify each of these six groups, shown in Figure 11. 

Using an RBF kernel with a = 1/2 we look at the first five canonical 
directions. Ideally, what we would see is a separation of each of the groups 
as well as a strong alignment between each of the spaces. What we find 
looking at Figure 12, a scatter plot matrix of the first five kernel canonical 
variates (KCV), is that while the leading correlations are large (0.98, 0.97, 
0.95, 0.80, 0.75), we are not able to find the structure in the data we were 
looking for, that is, separating out the six groups (with each of the colors 
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Fig. 11. In this plot each of the six underlying subgroups shown in Figure 10 is high- 
lighted. 

corresponding to one of the six groups). Note that only the projections in 
the smiley face space are shown since the cluster space projections look 
essentially the same. 

In the context of the protein-ligand matching problem this type of situ- 
ation presents a potential problem. Suppose a new point, say in the space 
with the smiley face, is projected into KCCA space. As can be seen in Fig- 
ure 12, there is a great deal of overlap between each of the six subgroups in 
the projected space. In particular note that each of the overlapped groups 
is composed of, respectively, the left eye, right eye and mouth. The reason 
this type of behavior presents a problem is that each of the eyes and the 
mouth are actually composed of two different subpopulations where each of 
the populations correspond to very different groups in the space with the 
two clusters. So while we may be able to accurately predict the location of 
a new point in KCCA space the interpretation of its surrounding neighbors 
may not be so meaningful. 

4. Indefinite kernel canonical correlation analysis. A potential short- 
coming of standard KCCA, which was illustrated in the example presented 
in Figure 9, is that standard positive definite kernels can be limited in their 
ability to capture nonstandard heterogeneous behavior in the data. A more 
general class of kernels which is better suited to handle this type of behavior 
takes the form 
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Fig. 12. Scatterplot matrix of the first five KCCA direction vectors for the data shown 
in Figure 9. Each of the colors in this plot corresponds to one of the six underlying sub- 
population in the data (see Figure 10 for details). 



Here iV(x) denotes some neighborhood of the observation x, such as a fc(E 
Z+) nearest neighborhood or a fixed radius e(> 0)-neighborhood. Kernels of 
this form restrict attention to the local structure of the data and allow for 
a flexible definition of similarity. 

Our motivation for considering this class of kernels in the context of the 
protein-ligand matching problem is the following. In the RLP800 dataset 
there are approximately 150 important subgroups in the data. These sub- 
groups correspond to unique proteins, or more specifically their binding 
pockets, which typically have three or four different conformations specific 
to a particular ligand. Exploitation of this group structure in the data can 
help improve prediction. This can be accomplished by using a "local ker- 
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nel" function that allows us to capture these groups more readily than, say, 
the RBF kernel. The intuition here follows from the example presented in 
Section 3.3 where we saw that the type of groups that an RBF kernel will 
be able to find will be dictated by the choice of the bandwidth parame- 
ter a. The local kernel overcomes this by adjusting locally to the data. By 
adjusting to the data locally it is better able to exploit this group struc- 
ture. 

In summary, given a new protein, its projection in this local kernel CCA 
space will be more likely to fall into a group of similar proteins. Then, as 
before, the goal is that the ligands associated with this group of proteins 
provide an accurate representation of the ligand we are trying to predict. 

This improved performance exploiting group structure in the data comes 
at some price. In particular, the problem encountered with this class of 
kernels is that they are frequently indefinite (see the discussion following 
Definition 4.1). As a result of the indefiniteness, many of the properties and 
optimality guarantees no longer hold. 

Indefinite kernels have recently gained increased interest [Ong, Canu and 
Smola (2004a), Haasdonk (2005), Chen and Ye (2008), Luss and d'Aspremont 
(2008)] where, rather than defining K to be a function defined in a RKHS, 
K is defined in a space characterized by an indefinite inner product called 
a Krein space. In Section 4.1 we provide an overview of some of the defini- 
tions and theoretical results about Krein spaces [following the discussion of 
Ong, Canu and Smola (2004a)]. 

Before discussing IKCCA, we will need to provide some definitions and 
theorems related to indefinite inner product spaces, that is, Krein spaces 
[more details can be found in Ong, Canu and Smola (2004a)] . 

4.1. Indefinite kernels. 

Definition 4.1 (Inner product). Let K, be a vector space on the scalar 
field. An inner product (•, -)/c on 1C is a bilinear form where for all /, g, h £ fC, 

• (f,g)x = (gJ)ic; 

• {af + g,h)ic = a(f ) h)]c + (g,h})c; 

• (/> g)lC = for all g £ K implies =4> / = 0. 

The importance of fC being a vector space on a scalar field is that it allows 
for a flexible definition of an inner product (i.e., the scalar in one of the 
dimensions could be complex or negative as we will see below). An inner 
product is said to be positive if for all / € /C, (/, f)/c > 0. It is called a neg- 
ative inner product, if for all / € /C, (/, f)jc < 0. An inner product is called 
indefinite if it is neither strictly positive nor strictly negative. 
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Remark 4.1. To illustrate how indefinite inner products arise in the 
context of our problem, consider the following. Suppose we have a symmet- 
ric kernel function K, which is indefinite, the implication of this is that 
the resulting kernel matrix K = {Kij}™ =1 is indefinite and that it therefore 
contains positive and negative eigenvalues. Let K = UAU T be the eigen- 
decomposition of K, where U are the eigenvectors and A is the diagonal 
matrix of eigenvalues starting with the p positive eigenvalues, followed by 
the q negative ones and the n — p — q eigenvalues equal to 0. To see how K 
can be interpreted as a matrix composed of inner products in this indefinite 
inner product space consider the following representation of its eigendecom- 
position: 

K = UlAj 1 / 2 diag(l p , -l q , n _ p _ 9 )|A| 1 / 2 U T '. 

Let M = diag(l p , — l q ) and 3? be equal to the first p + q columns of U|A| 1//2 . 
Define the ith row of 3> to be equal to 

$i = • ■ • j^ijjj^ijj+lj • • • i<i>i,p+q)- 

V v ' V v ' 

=$+ =<s>- 
We then have a kernel matrix composed of elements 

= (* f +) r *t - ($rf 

(4.2) 

= <$i,$i)« + - (^i,^j) H - 

From (4.2) we can see that unlike PSD kernels where a T Ka > for any 
a€lR n , with indefinite kernels a T Ka can take on any value, making opti- 
mization over such a quantity challenging. 

Despite this difference, many of the properties that hold for reproducing 
kernel Hilbert spaces (RKHS), such as (and perhaps most importantly) the 
reproducing property [Scholkopf and Smola (2002)], also hold for these in- 
definite inner product spaces [see Ong, Canu and Smola (2004b) for details]. 
The key difference lies in the fact that rather than minimizing (maximizing) 
a regularized risk functional, as in the RKHS setting, the corresponding op- 
timization problem becomes that of finding a stationary point of a similar 
risk functional. 

4.2. Indefinite kernel canonical correlation. Section 4.1 provided some 
insight into the challenges that arise from dealing with indefinite kernels. In 
particular, Remark 4.1 points to the fact that the solution that we find may 
not be globally, or even locally, optimal (as it may be a saddle point). The 
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form of the IKCCA problem we present in this section is motivated by the 
discussion of the previous section and the works of Ong, Canu and Smola 
(2004b) and Luss and d'Aspremont (2008). In particular, the addition of 
a stabilizing function on the indefinite inner product (/, f)fc as discussed in 
Ong, Canu and Smola (2004b) led us to consider introducing a constraint 
on the behavior on the indefinite kernels matrix itself. 

In the following, let || ■ \\p denote the Frobenius norm. Define M y to 
mean that the matrix M is positive semi-definite and let Ax, Ay G M + U oo 
be tuning parameters (discussed in more detail later this section). Here K. x 
and Ky are the (potentially) indefinite kernels and Kx and Ky will be the 
positive semi-definite approximations of these kernels. With this notation in 
mind, we now define the IKCCA optimization problem: 

p H = max min Tr(A^KjKyAy) + Ax||Kx - K° X \\ 2 F 

+ Ay ||Ky — Ky||fi, 

subject to 

A X K 2 X A X + kA x A x = I n , 

(4.3) AyKyAy + JtAyAy = I n , 

(a x ) T KxKya y = for j,i,j = 1, . . . ,n, 
Ky >r 0, 

where Ax = (ct x , . . . , a x ) and Ay = (a y , . . . , a y ). Note that this optimiza- 
tion problem and the KCCA optimization problem are only equivalent when 
the kernel matrices K. x and Ky are positive semi-definite (see the Supple- 
mentary Material [Samarov et al. (2011)] for details on the equivalency be- 
tween the optimization problem in (4.3) and (3.2) and a proof of Theorem 
4.3). 

Theorem 4.2. Letting Ax, Ay — > oo, the optimization problem in (4.3) 
is concave in a x and a l Y , i = 1, . . . ,n, and convex in Kx and Ky . 

See the Supplementary Material for a proof. Let (X) + denote the positive 
part of the matrix X, that is, (X) + = max(0, Aj)vjV?\ where Aj and Vj 
are zth eigenvalue-eigenvector pair of the matrix X. With this in mind, we 
have the following theorem. 

Theorem 4.3. Letting Ax, Ay — > oo, and given the optimization problem 
in (4.3) the optimal values for Kx and Ky are given by 

Kx = (K x ) + , 

4.4 

Ky = (K y ) + . 
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The proof of Theorem 4.3 makes use of the following lemma. Let Mo € 
W ixn be a known, square, not necessarily positive-definite matrix, and M € 
jgmxn a S q Uar6) un known matrix, then: 

Lemma 4.1. The solution to the optimization problem 

arg min ||M — Mollp, 

is 

M = (M )+. 

The proofs of Theorem 4.3 and Lemma 4.1 can be found in the Supplemen- 
tary Material. 

Points x 6 M. dx and y £ M. dY are projected onto their first p canonical 
directions as follows: first compute their kernelization, using the indefinite 
kernel functions K x and K Y , 

K° x (x, •) = (K° x (x, xi) , . . . , K° x (x, x n )) T , 
K Y (x,-) = (K Y (y,y 1 ),...,K Y (y,y n )) T . 

Then calculate 

K Y (y,-)A p Y , 

where A^- = (a x , . . . , a x ) and Ay = (a Y , . . . , a Y ) . 

4.3. Toy example 3: KCCA challenge (motivating IKCCA). We now re- 
turn to the example in Section 3.3 using the kernel defined in (4.1) with 
weights (4.5). Note that this kernel is closely related to the Normalized 
Graph Laplacian (NGL) kernel used in Spectral Clustering; see von Luxburg 
(2007) for an overview of Spectral Clustering methods. From Figure 13, it 
can be seen that we are now able to capture the underlying structure of the 
data, identifying each of the six subpopulations: 

expj-^— ||xj -Xj|| 2 j, if Xj eN k (xi), 

0, otherwise, 
and 

(4.5) Wi 



"'■.iv llj'" 



Here TVjfc(xj) is the symmetric /c-neighborhood of the point Xj [i.e., if Xj € 
N k (xi) then x« £ N k (xj)] and 



a ij ~ ll x « x i II ll x j x j I 
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Fig. 13. Continuation from the example in Section 4-3. This is a scatter plot matrix of 
the projections onto the first five IKCCA variates (IKCV) using the kernel in (4-1) with 
weights (4-5). Unlike the projections shown in Figure 12, here we are able to separate out 
the six groups. 



where is the feth neighbor of the point Xj. 

Looking at plots of the first four eigenvectors (Figures 14 and 15) in both 
the smiley face space and the cluster space, we can see how the behavior 
of the eigenvectors causes the segmentation of the data that we observe in 
Figure 13. First, we discuss how these figures are generated and then what 
it is they are telling us. 

(1) Generate an equally spaced dimensional grid spanning the range of 
values in each space. 

(2) Calculate the kernel representation and projection of each grid point 
into IKCCA space. 
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Fig. 14. A plot of the first four indefinite kernel canonical direction vectors in the smiley 
face space from the example in Section 4-3 using the kernel in (4-1) with weights (4-5). 
These plots allow us to visualize how the canonical vectors separate out each of the clusters. 

(3) Use the projected values to assign color intensities to each point in 
the grid of each space (darker for negative values, lighter for positive values). 

(4) Plot the grid and for each point using the colors calculated from the 
previous step. 

The important thing to note in both of these figures is the distribution 
of positive and negative projected values and how these are driving the 
segmentation, which we observe in Figure 13. For example, in Figure 14 the 
first canonical variate segments out one of the faces (red) from the other 
(blue). 

5. Ligand prediction. 

5.1. Prediction. Let us define the projected values of the observations 
in protein and ligand space onto their first p canonical vectors as = 
(w^, . . . , w^) T Xi , x£ p e W, and y£, = (w\r, . . . , w p /) T yi , y£, € W, 'i = 
1, ... ,n. The predicted value of ynew,p is calculated as follows [using a mod- 
ification of the LLE algorithm of Saul and Roweis (2003)]: 

(1) Compute the k neighbors of the data point x™ ew (the projected value 
of x new into canonical correlation space). Define iVfc(x) to be the k nearest 
neighbors of the point x. 



26 



D. SAMAROV ET AL. 



Second KCV 




Third KCV 





f' 3 



1 . * 



Fourth KCV 



1' 



Fig. 15. A p£ot o/ t/ie /irst /our indefinite kernel canonical directions vectors in the cluster 
space from the example in Section 4-3 using the kernel in (4-1) with weights (4-5). 



(2) Compute weights f3 new ,• that best reconstruct the data point x 



from its neighbors, minimizing the cost 



new,p 



(5.1) 



subject to 



x 



j:x 3 -eiV fe ( X » Wj3 ,) 



(3) The new observation is then calculated as 

Jnew,p ~ / j Pnewjy 



w 



Recall that CCA finds directions which best align two spaces. Thus, as- 
suming that directions w^- and Wy, i = 1, . . . ,p, have been found such that 
the correlation between spaces is strong, using the weights /3 new j found, in 
protein space should provide a reliable estimate of y™ ewp - 
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5.2. Tuning parameter selection. Values for the tuning parameters, k = 
kx = k-y (the regularization parameter), p = px = Py (the number of dimen- 
sions we are projecting into), /clle (the neighborhood for the LLE-based 
prediction), a (for the RBF kernel) and /cngl (for the NGL kernel) are 
found by searching over a suitable 3x3x3x3 grid for each. The final set 
of parameters are selected based on which produces the lowest average rank 
(discussed in Section 1). 

SUPPLEMENTARY MATERIAL 

Local kernel canonical correlation analysis with application to virtual 
drug screening (DOI: 10.1214/11-AOAS472SUPP; .pdf). Drug discovery 
is the process of identifying compounds which have potentially meaning- 
ful biological activity. A major challenge that arises is that the number 
of compounds to search over can be quite large, sometimes numbering in 
the millions, making experimental testing intractable. For this reason com- 
putational methods are employed to filter out those compounds which do 
not exhibit strong biological activity. This filtering step, also called virtual 
screening reduces the search space, allowing for the remaining compounds 
to be experimentally tested. 

In this paper we propose several novel approaches to the problem of virtual 
screening based on Canonical Correlation Analysis (CCA) and on a kernel 
based extension. Spectral learning ideas motivate our proposed new method 
called Indefinite Kernel CCA (IKCCA). We show the strong performance of 
this approach both for a toy problem as well as using real world data with 
dramatic improvements in predictive accuracy of virtual screening over an 
existing methodology. 
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